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Abstract 

We consider the problem of designing revenue maximizing online posted-price mechanisms when 
the seller has limited supply. A seller has k identical items for sale and is facing n potential buyers 
("agents") that are arriving sequentially. Each agent is interested in buying one item. Each agent's value 
for an item is an independent sample from some fixed (but unknown) distribution with support [0, 1]. 
The seller offers a take-it-or-leave-it price to each arriving agent (possibly different for different agents), 
and aims to maximize his expected revenue. 

We focus on mechanisms that do not use any information about the distribution; such mechanisms 
are called detail-free (or prior- independent). They are desirable because knowing the distribution is 
unrealistic in many practical scenarios. We study how the revenue of such mechanisms compares to the 
revenue of the optimal offline mechanism that knows the distribution ("offline benchmark"). 

We present a detail-free online posted-price mechanism whose revenue is at most 0{{k\og7if'/'^) 
less than the offline benchmark, for every distribution that is regular In fact, this guarantee holds without 
any assumptions if the benchmark is relaxed to fixed-price mechanisms. Further, we prove a matching 
lower bound. The performance guarantee for the same mechanism can be improved to 0{Vk log n), with 
a distribution-dependent constant, if the ratio ^ is sufficiently small. We show that, in the worst case over 
all demand distributions, this is essentially the best rate that can be obtained with a distribution-specific 
constant. 

On a technical level, we exploit the connection to multi-armed bandits (MAB). While dynamic pric- 
ing with unlimited supply can easily be seen as an MAB problem, the intuition behind MAB approaches 
breaks when applied to the setting with limited supply. Our high-level conceptual contribution is that 
even the limited supply setting can be fruitfully treated as a bandit problem. 

Keywords: mechanism design; revenue maximization; posted price; multi-armed bandits; regret. 

1 Introduction 

Consider an airline that is interested in selling k tickets for a given flight. The seller is interested in maximiz- 
ing her revenue from selling these tickets, and is offering the tickets on a website such as Expedia. Potential 
buyers ("agents") anive one after another, each with the goal of purchasing a ticket if the price is smaller 
than the agent's valuation. The seller expects n such agents to arrive. Whenever an agent arrives the seller 
presents to him a take-it-or-leave-it price, and the agent makes a purchasing decision according to that price. 

*A preliminary version of tliis paper, titled "Detail-free, Posted-Price Meclianisms for Limited Supply Online Auctions", has 
appeared in the Workshop on Bayesian Mechanism Design at ACM EC 2011. That version did not include the results in Section|6] 
^Microsoft Research Silicon Valley, Mountain View CA, USA. Email: {moshe, slivkins} @ microsoft.com. 
"'^Microsoft Research, Redmond WA, USA. Email: shaddingmicrosof t . com. 

^Department of Computer Science, Cornell University, Ithaca NY, USA. Email: rdkgcs . Cornell . edu. 
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The seller can update the price taking into account the observed history and the number of remaining items 
and agents. 

We adopt a Bayesian view that the valuations of the buyers are IID samples from a fixed distribution, 
called demand distribution. A standard assumption in a Bayesian setting is that the demand distribution is 
known to the seller, who can design a specific mechanism tailored to this knowledge. (For example, the 
Myerson optimal auction for one item sets a reserve price that is a function of the distribution). However, in 
some settings this assumption is very strong, and should be avoided if possible. For example, when the seller 
enters a new market, she might not know the demand distribution, and learning it through market research 
might be costly. Likewise, when the market has experienced a significant recent change, the new demand 
function might not be easily derived from the old data. 

Ideally we would like to design mechanisms that perform well for any demand distribution, and yet do 
not rely on knowing it. Such mechanisms are called detail-free^ in the sense that the specification of the 
mechanism does not depend on the details of the "environment", in the spirit of Wilson's Doctrine ll39l . 
Learning about the demand distribution is an integral part of the problem that a detail-free mechanism faces. 
The performance of such mechanisms is compared to a benchmark that does depend on the specific demand 
distribution, as in OOl l27l [T2l l23l and many other papers. 

In this paper we take this approach and design detail-free, online posted-price mechanisms with revenue 
that is close to the revenue of the optimal offline mechanism (that can depend on the demand distribution 
and is not restricted to be posted price). Our main results are for any demand distribution that is regular, or 
any demand distribution that satisfies the stronger condition of "monotone hazard rate". Both conditions are 
mild and standard, and even the stronger one is satisfied by most common distributions, such as the normal, 
uniform, and exponential distributions. 

Posted price mechanisms are commonly used in practice, and are appealing for several reasons. First, an 
agent only needs to evaluate her offer rather than compute her private value exactly. Human agents tend to 
find the former task much easier than the latter. Second, agents do not reveal their entire private information 
to the seller: rather, they only reveal whether their private value is larger than the posted price. Third, posted- 
price mechanisms are truthful (in dominant strategies) and moreover also group strategy-proof (a notion of 
collusion resistance when side payments are not allowed). Further, detail-free posted-price mechanisms are 
particularly useful in practice as the seller is not required to estimate the demand distribution in advance. 
Similar arguments can be found in prior work, e.g. ll20l . 



Our modeL We consider the following limited supply auction model, which we term dynamic pricing with 
limited supply. A seller has k items she can sell to a set of n agents (potential buyers), aiming to maximize 
her expected revenue. The agents arrive sequentially to the market and the seller interacts with each agent 
before observing future agents (in an online manner). We make the simplifying assumption that each agent 
interacts with the seller only once, and the timing of the interaction cannot be influenced by the agent. (This 
assumption is also made in other papers that consider our problem for special supply amounts ll30i r7l[T2l.) 
Each agent i (1 < i < n) is interested in buying one item, and has a private value Vi for an item. The 
private values are independently drawn from the same demand distribution F. The demand distribution F 
is unknown to the seller, but it is known that F has support in [0, 

Whenever agent i arrives to the market the seller offers him a price pi for an item. The agent buys 
the item if and only if vi > pi, and in case she buys the item she pays pi (so the mechanism is incentive- 
compatible). The seller never learns the exact value of Vi, she only observes the agent's binary decision 
to buy the item or not. The seller selects prices pi using an online algorithm, that we henceforth call 

'An alternative teim used to describe tliese mechianisms is prior-independent. 

^Assuming that support (F) C [0, 1] is w.l.o.g. (by normalizing) as long as the seller knows an upper bound on the support. 
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pricing strategy. We are interested in designing pricing strategies with high revenue compared to a natural 
benchmark, with minimal assumptions on the demand distribution. 

Our main benchmark is the maximal expected revenue of an offline mechanism that is allowed to use 
the demand distribution; henceforth, we will call it offline benchmark. This is a very strong benchmai^k, as 
it has the following advantages over our mechanism: it is allowed to use the demand distribution, it is not 
constrained to posted prices and is not constrained to run online. It is realized by a well-known Myerson 
Auction |[35l (which does rely on knowing the demand distribution). 

High-level discussion. Absent the supply constraint, our problem fits into the multi-armed bandit (MAB) 
framework ifTSl : in each round, an algorithm chooses among a fixed set of alternatives ("arms") and observes 
a payoff, and the objective is to maximize the total payoff over a given time horizon. Our setting corresponds 
to (prior-free) MAB with stochastic payoffs [3 1 11 : in each round, the payoff is an independent sample from 
some unknown distribution that depends on the chosen "arm" (price). This connection is exploited in |[30l 
[131 for the special case of unUmited supply {k = n). The authors use a standard algorithm for MAB with 
stochastic payoffs, called UCBl (41. Specifically, they focus on the prices {i5 : i G N}, for some parameter 
6, and run UCBl with these prices as "arms". The analysis relies on the regret bound from |i4j|. 

However, neither the analysis nor the intuition behind UCBl and similar MAB algorithms is directly 
applicable for the setting with hmited supply. Informally, the goal of an MAB algorithm would be to 
converge to a price p that maximizes the expected per-round revenue R{p) = p{l — F{p)). This is, in 
general, a wrong approach if the supply is limited: indeed, selling at a price that maximizes R{-) may 
quickly exhaust the inventory, in which case a higher price would be more profitable. 

Our high-level conceptual contribution is showing that even the limited supply setting can be fruitfully 
treated as a bandit problem. The MAB perspective here is that we focus on the trade-off between exploration 
(acquiring new information) and exploitation (taking advantage of the information available so far). In 
particular; we recover an essential feature of UCBl that it does not sepai^ate exploration and exploitation, and 
instead explores arms (prices) according to a schedule that unceasingly adapts to the observed payoffs. This 
feature results, both for UCBl and for our algorithm, in a much more efficient exploration of suboptimal 
arms: very suboptimal arms ai^e chosen very rarely even while they are being "explored". 

We use an "index-based" algorithm where each arm is deterministically assigned a numerical score 
("index") based on the past history, and in each round an arm with a maximal index is chosen; the index of 
an arm depends on the past history of this arm (and not on other arms). One key idea is that we define the 
index of an aim according to the estimated expected total payoff from this arm given the known constraints, 
rather than according to its estimated expected payoff in a single round. This idea leads to an algorithm that 
is simple and (we believe) very natural. However, while the algorithm is simple its analysis is not: some new 
ideas are needed, as the elegant tricks from prior work do not apply (see Section|4]for further discussion). 

Contributions. In all results below, we consider the dynamic pricing problem with limited supply: n 
agents and k < n items. We present pricing strategies with expected revenue that is close to the offline 
benchmark, for large families of natural distributions. All our pricing strategies are deterministic and (triv- 
ially) run in polynomial time. Our main result follows. 

Theorem 1.1. There exists a detail-free pricing strategy such that for any regular demand distribution its 
expected revenue is at least the offline benchmark minus 

We emphasize that Theorem 1 1 . 1 1 holds for a pricing strategy that does not know the demand distribution. 
The resulting mechanism is incentive-compatible as it is a posted price mechanism. The specific bound 
0{{k log n)^/^) is most informative when k log n, so that the dependence on n is insignificant; the focus 
here is to optimize the power of k. (Note that any non-trivial bound must be below k.) 
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The proof of Theorem 11.11 consists of two stages. The first stage (immediate from Yan BOl ) is to 
observe that for any regular demand distribution the expected revenue of the best fixed-price strategjH is 
close to the offline benchmark. Henceforth, the expected revenue of the best fixed-price strategy will be 
called the fixed-price benchmark. The second stage, which is our main technical contribution, is to show 
that our pricing strategy achieves expected revenue that is close to the fixed-price benchmark. Surprisingly, 
this holds without any assumptions on the demand distribution. 

Theorem 1.2. There exists a detail-free pricing strategy whose expected revenue is at least the fixed-price 
benchmark minus 0{{k log n)^/^). This result holds for every demand distribution. Moreover, this result is 
the best possible up to a factor ofO(\og n). 

As discussed above, we recover the MAB technique from for the unlimited supply setting. The 
corresponding contribution to the literature on MAB may be of independent interest. 

If the demand distribution is regular- and moreover the ratio ^ is sufficiently small then the guarantee in 
Theorem 1 1.1 l ean be improved to 0{Vk log n), with a distribution-specific constant. 

Theorem 1.3. There exists a detail-free pricing strategy whose expected revenue, for any regular demand 
distribution F, is at least the offline benchmark minus 0{cf ^/k\ogn) whenever ^ < sp, where cp and sp 
are positive constants that depend on F. For monotone hazard rate distributions one can take sp = j. 

The bound in Theorem 11.31 is achieved using the pricing strategy from Theorem 11.11 with a different 
parameter. Varying this parameter, we obtain a family of strategies that improve over the bound in Theo- 
rem ll.ll in the "nice" setting of Theorem 1 1.3 1 and moreover have non-trivial additive guarantees for ai^bitrary 
demand distributions. However, we cannot match both theorems with the same parameter. 

Note that the rate-^/k dependence on k in Theorem 11.31 contains a distribution-dependent constant cp 
(which can be arbitrarily large, depending on F), and thus is not directly comparable to the rate-A;^/^ depen- 
dence in Theorem 1 1.2 1 The distinction (and a significant gap) between bounds with and without distribution- 
dependent constants is not uncommon in the literature on sequential decision problems, e.g. in 0130112910 

In fact, we show that the cp \fk dependence on k is essentially the best possible E] We focus on the 
fixed-price benchmark (which is a weaker benchmark, so it gives to a stronger lower bound). Following the 
literature, we define regret as the fixed-price benchmark minus the expected revenue of our pricing strategy. 

Theorem 1.4. For any j < ^, no detail-free pricing strategy can achieve regret 0{cp k"*) for all demand 
distributions F and arbitrarily large k, n, where the constant cp can depend on F. 

The bounds in Theorem 1 1.1 1 and Theorem 1 1.21 ar e uninformative when k = 0(log^ n). We next provide 
another detail-free, online posted-price mechanism that gives meaningful bounds - not depending on n - in 
the case that k is very small (but bigger than some constant). 

Theorem 1.5. There exists a detail-free pricing strategy such that for any MHR demand distribution its 
expected revenue is at least the offline benchmark minus 0{k'^/^ polylog(fc)). 

''a fixed-price strategy is a pricing strategy that offers the same price to all agents, as long as it has items to sell. The "best" 
fixed-price strategy is one with the maximal expected revenue for a given demand distribution. 

''For a particularly pronounced example, for the A'-armed bandit problem with stochastic payoffs the best possible rates for 



regret with and without a distribution dependent constant are respectively 0(cf log n) and 0(\/ Kn) l4ll5][3l. 

^However, the lower bound in Theorem I 1 .4| does not match the upper bound in Theorem 1 1 . 3 1 since the latter assumes regularity. 
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2 Related Work 



Dynamic pricing. Dynamic pricing problems and, more generally, revenue management problems, have 
a rich literature in Operations Research. A proper survey of this literature is beyond our scope; see |[T2l for 
an overview. The main focus is on parameterized demand distributions, with priors on the parameters. 

The study of dynamic pricing with unknown demand distribution (without priors) has been initiated 
in |[T5l[30l . Several special cases of our setting have been studied in OOl lTl fTlll . detailed below. 

First, Kleinberg and Leighton OOl consider the unlimited supply case (building on the earlier work lITSl ). 
Among other results, they study IID valuations, i.e. our setting with k = n. They provide upper bounds 
on regret of order 0(n^/^) and 0{cf \/n).^ The latter bound is akin to Theorem 1 1.3 1 in that it assumes a 
version of regularity, and depends on a distribution-specific constant cp- Further, they prove matching lower 
bounds which, in particular, imply Theorem [L4] for the special case of unlimited supply. 

On the other extreme, Babaioff et al. Q consider the case that the seller has only one item to sell (k = 1). 
They provide a super-constant multiplicative lower bound for unrestricted demand distribution (with respect 
to the online optimal mechanism), and a constant-factor approximation assuming MHR. Note that we also 
use MHR to derive bounds that apply to the case of a very small k. 

Besbes and Zeevi |[T2ll consider a continuous-time version which (when specialized to discrete time) 
is essentially equivalent to our setting with k = il(n). They prove a number of upper bounds on regret 
with respect to the fixed-price benchmark, with guarantees that are inferior to ours. The key distinction 
is that their pricing strategies separate exploration and exploitation. Assuming that the demand distribu- 
tion F{-) and its inverse F^^{-) are Lipschitz-continuous, they achieve regret 0(n^/'^). They improve it 
to 0(n^/'^) if furthermore the demand distributions are parameterized, and to 0{y/n) if this is a single- 
parameter parametrization. Both results rely on knowing the parametrization: the mechanisms continuously 
update the estimates of the parameter(s) and revise the current price according to these estimates. The upper 
bounds in |[T2l should be contrasted with our 0(/c^/^) upper bound that applies to an arbitrary k and makes 
no assumptions on the demand distribution, and the 0{cf Vk) improvement for MHR demand distributions. 

Also, lfT2ll contains an Q.{y/n) lower bound for their notion of regret. Essentially, this lower bound com- 
pares the best pricing strategy for a given demand distribution to the best (distribution-dependent) pricing 
strategy for a fictitious environment where in every round the mechanism sells a fractional amount of good. 
In particular, this lower bound does not have any immediate implications on regret with respect to either of 
the two benchmarks that we use in this paper. 

Online mechanisms. The study of online mechanisms was initiated by Lavi and Nisan ||32l . who unlike 
us consider the case that each agent is interested in multiple items, and provide a logarithmic multiplicative 
approximation. Below we survey only the most relevant papers in this line of work, in addition to the special 
cases of our setting that we have already discussed. 

Several papers |[T0l[T5ll30l[T4l consider online mechanisms with unlimited supply and adversarial valua- 
tions (as opposed to limited supply and IID valuations in our setting). The mechanism in the initial paper lITOl 
requires the agents to submit bids and so is not posted-price. The subsequent work |[T5l [30l [141 provides 
various improvements. In particular, Blum et al. [[15 ] (among other results) design a simple posted-price 
mechanism which achieves multiplicative approximation 1 + e, for any e > 0, with an additive term that 
depends on e. Blum and Hartline |[T4l use a more elaborate posted-price mechanism to improve the ad- 
ditive term. Kleinberg and Leighton ||30l show that the simple mechanism in lITSl achieves regret 0{n'^/^); 

^Throughout this section, we omit the log factors in regret bounds. 

^The construction in 1301 that proves Theorem ll.4t a) for the unlimited supply case is contained in the proof of a theorem on 
adversarial valuations, but the construction itself only uses IID valuations. 

'*This result considers valuations in the range [i,H], and the additive term also depends on H. 
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moreover, they provide a nearly matching lower bound of 17 

Papers |[26l 1211 study online mechanisms for limited supply and IID valuations (same as us), but their 
mechanisms are not posted-price. Hajiaghayi et al. |[26l consider an online auction model where players 
aiTive and depart online, and may misreport the time period during which they participate in the auction. This 
makes designing strategy-proof mechanisms more challenging, and as a result their mechanisms achieve a 
constant multiplicative approximation rather than additive regret. Devanur and Hartline f2T]| study several 
variants of the limited-supply mechanism design problem: supply is known or unknown, online or offline. 
Most related to our paper is their mechanism for limited, known, online supply. This mechanism is based 
on random sampling and achieves constant (multiplicative) approximation, but is not posted-price. Our 
mechanism is posted-price and achieves low (additive) regret. 

Other work. Absent the supply constraint, our problem (and a number of related formulations) fit into 
the multi-armed bandit (MAB) frameworkj^l MAB has a rich literature in Statistics, Operations Research, 
Computer Science and Economics. A proper discussion of this literature is beyond the scope of this paper; 
a reader can refer to ifTSl [TTI for background. Most relevant to our specific setting is the work on (prior- 
free) MAB with stochastic payoffs, e.g. lISTl l4l. and MAB with Lipschitz-continuous stochastic payoffs, 
e.g. |[2l|28l|6l|29l[l7l. The posted-price mechanisms in lITSl l30l [141 described above are based on a well- 
known MAB algorithm [3 for adversarial payoffs. The connection between online learning and online 
mechanisms has been explored in a number of other papers, including |[36ll22l l9l[8l. 

Recently, ll20l [191 [40l studied the problem of designing an offline, sequential posted-price mechanisms 
in Bayesian settings, where the distributions of valuations ai^e not necessarily identical, yet are known to the 
seller. Chawla et al. |[20l provide constant multiplicative approximations. Yan BOl obtains a multiplicative 
bound that is optimal for large k, and Chakraborty et al. |[T9l obtain a PTAS for all k. 

3 Preliminaries 

Throughout, we assume that agents' valuations are drawn independently from a distribution F with support 
in [0, 1], called demand distribution. We use p G [0, 1] to denote a price. We let F{p) denote the c.d.f, 
and S{p) = 1 — F{p) denote the survival rate at price p. Let R{p) = pS{p) denote the revenue fimction: 
the expected single-round revenue at price p given that there is still at least one item left. The demand 
distribution F is called regular if F(-) is twice differentiable and the revenue function R{-) is concave: 
R"{-) < 0. We call F strictly regular if furthermore R"{-) < 0. Then R{p) is increasing for p < Px and 
decreasing for p > p^, where p^ is the unique maximizer, known as the Myerson reserve price. Moreover, 
the survival rate S{-) is strictly decreasing, so the inverse is well-defined. We say F is a Monotone 
Hazard Rate (MHR) distribution if F(-) is twice differentiable and the hazai^d rate H{p) = F'{p)/S{p) is 
non-decreasing. All MHR distributions are regular. 

K fixed-price strategy with n agents, k items and price p, denoted A^{p), is a pricing strategy that 
makes a fixed offer price p to every agent so long as fewer than k items have been sold, and stops afterwards 
(equivalently, from that point always sets the price to oo). Note that for the unlimited supply case A^iv) 
sells n S{p) items in expectation. 

A pricing strategy is called detail-free if it does not use the knowledge of the demand distribution. We are 
interested in designing detail-free pricing strategies with good performance for every demand distribution 
in some (large) family of distributions. We compare our mechanisms to two benchmarks that depend on 

'To void a possible confution, we note that the supply constraint in our setting may appear similar to the budget constraint in line 
of work on budgeted MAB (see I16II25I for details and further references). However, the "budget" in budgeted MAB is essentially 
the duration of the experimentation phase (n), rather than the number of rounds with positive reward (fc). 
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the demand distribution: the maximal expected revenue of an offline mechanism (the offline benchmark), 
and the maximal expected revenue of a fixed price mechanism (the fixed-price benchmark). An offline 
mechanism that maximizes expected revenue was given in the seminal paper of Myerson |[35l ; it is not an 
online posted price mechanism. 

Let Rev(yt) be the total expected revenue achieved by mechanism A. We define the regret of A with 
respect to the fixed-price benchmark as follows: Regret(^) = maxp Rev[^^(p)] — Rev(^). Thus, regret 
is the additive loss in expected revenue compared to the best fixed-price mechanism. (Note that the regret of 
A could, in principle, be a negative number, since the fixed-price benchmark is not generally the Bayesian 
optimal pricing strategy for distribution F.) 



Benchmarks Comparison. We observe that for regular demand distributions, the fixed-price benchmark 
is close to the offline benchmark. This result is immediate from Yan [|40I ; we provide a self-contained proof 
in Appendix El 

Lemma 3.1 (Yan 1401 ). For each regular demand distribution there exists a fixed-price strategy whose 
expected revenue is at least the offline benchmark minus 

Lemma [37T] implies that any pricing strategy with regret 0{R), R = Q{^/k) with respect to the fixed- 
price benchmark has the same asymptotic regret 0{R) with respect to the offline benchmai^k, as long as 
the demand distribution is regular, and in particular if it is MHR. Therefore, the rest of the paper can focus 
on the fixed-price benchmark. In particular, our main result. Theorem 1 1.1 1 for regular distributions, follows 
from Theorem [L2] that addresses the fixed-price benchmai^k. 

Furthermore, the expected revenue of a fixed-price mechanism has an easy characterization: 

Claim 3.2. Let A be the fixed-price mechanism with price p. Let y{p) = pmin(/c, n S{p))). Then 

v{p) - 0{p^/k\ogk) < Rev(^) < v{p). (1) 

It follows that for a strictly regular demand distribution the bound in Lemma 1X7] is satisfied for the fixed 
price p* = argmaXpZ^(p) = max(pr, where p,- = argmaXppS'(p) is the Myerson reserve price. 

Proof. Let us focus on the first inequality in ([T|) (the second one is obvious). Let Xt be the indicator variable 
of sale in round t. Denote X = X]"=i -^t and let fj, = K[X]. Then by Chernoff Bounds (Theorem I4.7f a)) 
with probability at least 1 — it holds that X > /i — 0{^/J^ogk), in which case 

#sales = mm{k, X) > min(A;, /i — 0{y^ p, log A;)) > min(/c, ^) — 0{\Jk\og k), 
which implies the claim since n = n S{p). □ 



4 The main technical result: the upper bound in Theorem [L2 



This section is devoted to the main technical result (the upper bound in Theorem 11.21 ) which asserts that 
there exists a detail-free pricing strategy whose regret with respect to the fixed-price benchmark is at most 
0{k log n)^/^. This result is very general, as it makes no assumptions on the demand distribution. 

As discussed in Section [H we design an algorithm that carefully optimizes the trade-off between explo- 
ration and exploitation. We use an index-based algorithm in which each arm is assigned a numerical score, 
called index, so that in each round an arm with the highest index is picked. The index of an arm depends 
only on the past history of this arm. In prior work on index-based bandit algorithms the index of an arm was 
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defined according to estimated expected payoff from this arm in a single round. Instead, we define the index 
according to estimated expected total payoff from this arm given the constraints. 

We apply the above idea to UCBl. The index in UCBl is, essentially, the best available Upper Confidence 
Bound (UCB) on the expected single-round payoff from a given aim. Accordingly, we define a new index, 
so that the index of a given price corresponds to a UCB on the expected total payoff from this price (i.e., 
from a fixed-price strategy with this price), given the number of agents and the inventory size. Such index 
takes into account both the average payoff from this arm ("exploitation") and the number of samples for 
this arm ("exploration"), as well as the supply constraint. In particular- we recover the appealing property of 
UCBl that it does not separate "exploration" and "exploitation", and instead explores arms (prices) according 
to a schedule that unceasingly adapts to the observed payoffs. 

There are several steps to make this approach more precise. First, while it is tempting to use the current 
values for the number of agents and the inventory size to define the index, we adopt a non-obvious (but more 
elegant) design choice to use the original values, i.e. the n and the k. Second, since the exact expected total 
payoff for a given price is hard to quantify, we will instead use a natural approximation thereof provided by 
u{p) in Claim ll!2l In other words, our index will be a UCB on ^{p). Third, in specifying the UCB we will 
use non-standard estimator from ||29l to better handle prices with very low survival rate. 

The main technical hurdle in the analysis is to "charge" each suboptimal price for each time that it is 
chosen, in a way that the total regret is bounded by the sum of these charges and this sum can be usefully 
bounded from above. The analysis of UCBl accomplishes this via simple (but very elegant) tricks which, 
unfortunately, fail in the limited supply setting. 

An additional difficulty comes from the probabilistic nature of the analysis. While we adopt a well- 
known trick - we define some high-probability events and assume that these events hold deterministically 
in the rest of the analysis - choosing an appropriate collection of events is, in our case, non-trivial. Proving 
that these events indeed hold with high probability rehes on some non-standard tail bounds from prior work. 



4.1 Our pricing strategy 

Let us define our pricing strategy, called CappedUCB. The pricing strategy is initialized with a set V of 
"active prices". In each round t, some price p £ V is chosen. Namely, for each price p G P we define a 
numerical score, called index, and we pick a price with the highest index, breaking ties arbitrarily. Once k 
items ai^e sold, CappedUCB sets the price to oo and never sells any additional item. 

Recall from Claim |3^ that the expected revenue from the fixed-price strategy A'^{p) is approximated by 
^{p) = p ■ min(A;, n S{p)). In each round t, we define the index It{p) as a UCB on ^{p): 

It{p) — p • min(A;, n S^{p)). 

Here S^^{p) is a UCB on the survival rate S{p), as defined below. 

For each p G "P, let Nt{p) be the number of rounds before t in which price p has been chosen, and 
let kt{p) be the number of items sold in these rounds. Then St{p) = kt{p) /Nt{p) is the cuiTcnt average 
survival rate. To avoid division by zero, we define St (p) to be equal to 1 when Nt (p) = 0. We will define 
^t^(.P) = 'S't(p) + rt{p), where rt{p) is a confidence radius: some number such that 

\S{p) - St{p)\ < rt{p) (Vp €V,t<n). (2) 

holds with high probabiUty, namely with probability at least 1 — n~^. 

We need to define a suitable confidence radius rt{p), which we want to be as small as possible subject 
to dll). Note that rt{p) must be defined in terms of quantities that are observable at time t, such as Nt{p) and 

St{p). A standai^d confidence radius used in the literature is (essentially) rt{p) = \J ) 



8 



Instead, we use a more elaborate confidence radius from |[29ll : 



nip) = 



a 



+ 



aSt{p) 



for some a = B(logn). 



(3) 



Nt{p) + 1 



Nt{p) + 1 




It{p) — P ■ min(/c, n {St{p) + rt{p))), where rt{p) is from (O. 



(4) 



Finally, the active prices are given by 



V = {S{1 + 5y G [0, 1] : i G N}, where 6 G (0, 1) is a parameter. 



(5) 



This completes the specification of CappedUCB. See Mechanism [T] for the pseudocode. 

Mechanism 1 Pricing strategy CappedUCB for n agents and k items 
Parameter: 5 G (0, 1) 

I: V ^ {6{1 + 6y G [0, 1] : i G N} {"active prices"} 
2: While there is at least one item left, in each round t 

pick any price p G argmaXpg-p It{p), where It{p) is the "index" given by (01). 
3: For all remaining agents, set price p = oo. 



4.2 Analysis of the pricing strategy 

Our goal is to bound from above the regret of CappedUCB, which is the difference between the optimal ex- 
pected revenue of a fixed-price strategy and the expected revenue of CappedUCB. We prove that CappedUCB 
achieves regret 0{k log n)^/^ for a suitable choice of parameter 5 in ([5]). 

Lemma 4.1. CappedUCB with parameter 6 = k~^/^ (log n)^/^ achieves regret 0{k log n)^/^. 

Since the bound in Lemma |4TT] is trivial for k < log^ n, we will assume that k > log^ n from now on. 

Note that CappedUCB "exits" (sets the price to oo) after it sells k items. For a thought experiment, 
consider a version of this pricing strategy that does not "exit" and continues running as if it has unlimited 
supply of items; let us call this version CappedUCB'. Then the realized revenue of CappedUCB is exactly 
equal to the realized revenue obtained by CappedUCB' from selling the first k items. Thus from here on we 
focus on analyzing the latter. 

We will use the following notation. Let Xt be the indicator variable of the random event that CappedUCB' 
makes a sale in round t. Note that Xt is a 0-1 random variable with expectation S{pt), where pt depends 
on Xi, . . . , Xt^i. Let X = X]"=i -^t be the total number of sales if the inventory were unlimited. Note 
that E[X] = S = "^1=1 S{pt). Going back to our original algorithm, let Rev denote the realized revenue of 
CappedUCB (revenue that is realized in a given execution). Then 



Rev = Y.t=i Pt 



where N = max{7V < 7i : Etli^t < '^l- 



(6) 
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High-probability events. We tame the randomness inherent in the sales Xt by setting up three high- 
probability events, as described below. In the rest of the analysis, we will argue deterministically under 
the assumption that these three events hold. It suffices because the expected loss in revenue from the low- 
probability failure events will be negligible. The three events are summarized in the following claim: 

Claim 4.2. With probability at least 1 — holds, for each round t and each price p G V: 

\S{p) - SM\ < n{p) < 3 + yUg) , (7) 

\X-S\< 0{^S logn + logn), (8) 
lELi PtiXt - S{pt))\ < 0(V^ logn + logn). (9) 

The probability bounds on the three events in Claim 14.21 are derived via appropriate concentration in- 
equalities, some of which are non-standard; see Section l43] for further discussion. In the first event, the left 
inequality asserts that rt (p) is a confidence radius, and the right inequality gives the performance guarantee 
for it. The other two events focus on CappedUCB', and bound the deviation of the total number of sales (X) 
and the realized revenue iYlt=i Pt -^t) from their respective expectations; importantly, these bound are in 
terms of \/S' rather than ^/n. 

In the rest of the analysis we will assume that the three events in Claim l43] hold deterministically. 



Single-round analysis. Let us analyze what happens in a particular round t of the pricing strategy. Let pt 
be the price chosen in round t. Let pl^^ G argmax^gp v{p) be the best active price according to and let 
'^act — ^(Pact)- Let A(p) = max(0, ^ ly*^^ — pS{p)) be our notion of "badness" of price p, compared to the 
optimal approximate revenue u*. We will use this notation throughout the analysis, and eventually we will 
bound regret in terms of Ylpev ^(p) ^{p)-> where N{p) is the total number of times price p is chosen. 

Claim 4.3. For each price p £ V it holds that 

N{p) A(p) < 0(log n) (l + I . (10) 

Proof. By definition Q of the confidence radius, for each price p £ V and each round t we have 

I'ip) < It{p) < P ■ min(A;, n {S{p) + 2rt(p))) . (11) 



Let us use this to connect each choice pt with i>*^^: 

Utipt) > Itiplct) > ^{Pla) - Ka 

\lt{pt) < Pt ■ min(A;, n {S{pt) + 2rt{pt))) ■ 

Combining these two inequalities, we obtain the key inequality: 

i u:,^ < Pt • min S{pt) + 2 rt{pt)) . (12) 
There are several consequences forp^ and A{pt): 

Pt > I Kct 

A{pt) < 2ptrt{pt) . (13) 

A(pi)>0 ^ S{pt)<l-^ 
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The first two lines in ([T3] ) follow immediately from (fT2l ). To obtain the third line, note that A{pt) > 
implies ptk > h'*^f > npt S{pt), which in turn implies S{pt) < ^. 

Note that we have not yet used the definition Q of the confidence radius. For each price p = p^, let t be 
the last round in which this price has been selected by the pricing strategy. Note that N{p) (the total number 
of times price p is chosen) is equal to Nt{p) + 1. Then using the second line in ([T3l) to bound A(p), Eq. ^ 
to bound the confidence radius rt{p), and the third Une in (fT3] ) to bound the survival rate, we obtain: 

A(p)<0(p)xmax(]^, ^f^) • 
Rearranging the terms, we can bound N{p) in terms of A(p) and obtain ( fTOl ). □ 

Analyzing the total revenue. A key step is the following claim that allows us to consider X^JL^ pt S{pt) 
instead of the realized revenue Rev, effectively ignoring the capacity constraint. This is where we use the 
high-probability events ([D and Q. For brevity, let us denote P{S) = 0{\/Slogn + log n). 

Claim 4.4. > min(i.*,„ Et=iPt S{pt)) - Pih)- 

Proof. Recall that pt > by It follows that Rev > vl^^ whenever Ya^i Xt > k. Therefore, if 
Rev < i/*j.t then ^27=1 < k and so Rev = ^27=1 Pt ^t- Thus, by Q it holds that 

Rev > min {u*^^, Ylt=i Pt ^t) > min {u*^^, Ylt=i Pt S{pt) - P{S)) . 
So the claim holds when S < k. On the other hand, if 5 > A; then by ([8]l it holds that 

X>S- (3{S) >k- I3{k) 

> min(A;, X) (i vl,,) > 1^1,, - p{k). □ 

In light of Claim 1441 we can now focus on Y2t=i Pt S{pt)- 

Et=lPtSiPt)>Et=l'n<ct-MPt) 
= Ka - Ztl ^(Pt) 

= Kc:-EpeV^iP)NiP)- (14) 
Fix a parameter e > to be specified later, and denote 

^ei ={peV: N{p) > 1} 
Ve ={pe Psel : A(p) > e} 

to be, respectively, be the set of prices that have been selected at least once and the set of prices of badness 
at least e that have been selected at least once. Plugging (fTOl) into (fT4l) . we obtain 

<6n + 0(logn)E,eP. (i + ^am) 

<en + 0(log n) (|n| + ^ ZpeV. a^) • (15) 
Combining (fT4l) . (fTSl ) and Claim |44] yields a claim that summarizes our findings so far. 
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Claim 4.5. For any set V of active prices and any parameter e > it holds that 

vl,, - E[Ri^] < en + O(logn) (|n| + | EpeP, am) + m- 

Interestingly, this claim holds for any set of active prices. The following claim, however, takes advantage 
of the fact that the active prices are given by dU. 

Claim 4.6. z/*^, > i^* — 5 A;, where v* = maxp v{p). 

Proof. Let p* E argmax^ ^{p) denote the best fixed price with respect to ties broken arbitrarily. If 
p* < 6 then u* < 5k. Else, letting po = max{p £ V : p < p*} we have po/p > > 1 — 5, and so 

Kci > Hpo) > p Hp*) >u*{l-6)>iy*- 6k. □ 
It follows that for any e > and 6 G (0, 1) we have: 

Regret < 0(log n) (\V,\ + ^ YlpeV. sfey) + en + 6k + /3(A;). (16) 
The rest is a standard computation. Plugging in A{p) > e for each p £ "P^ in ( fT6l ). we obtain: 

Regret < 0(|P,| logn) {l + ^ ^) + en + 6k + P{k). 
Note that \V\ < ^ log n. To simplify the computation, we will assume that 6 > ^ and e = 6 y^. Then 

Regret < O (^5k + jj{\ognf + ^A;logn^ . (17) 

Finally, it remains to pick 6 to minimize the right-hand side of (fT/l) . Let us simply take 6 such that the first 
two summands are equal: 6 = k~^/'^ (log 77,)^/'^. Then the two summands are equal to 0{k\ogn)^/^ . This 
completes the proof of Lemma 14. II 

4.3 Concentration inequalities and the proof of Claim 

We use an elementary concentration inequality known as Chemoff Bounds, in a formulation from |[34l . 

Theorem 4.7 (Chernoff Bounds). Consider n i.i.d. random variables X\ . . . Xn with values in [0, 1]. Let 

X = ^"^^=1X1 be their average, and let ji = E[X]. Then: 

(a) Vt[\X - ^jl\> 6^1] < 2 e-^""^^^/'^ for any 6 G (0, 1). 

(b) Pr[X > a] < 2"'"'' for any a > Qfi. 

Further, we use a non-standard corollary from ||29ll which provides us with a sharper (i.e., smaller) 
confidence radius when /x is small; we include the proof for the sake of completeness. 

Theorem 4.8 (1291). Consider n i.i.d. random variables Xi . . . Xn on [0, 1]. Let X be their average, and 
let fi = K[X]. Then for any a > 0, letting r{a,x) = ^ + ^f^, we have: 

Vt\\X-ii\ <r{a,X) <3r(a,^)] > l-e~^("), 
'"This is Lemma 4.9 in the full (arXiv) version of 1291 . 
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Proof. First, suppose ^ > Apply Theorem l4.7f a) with 5 = \ 
1 — e~^^") we have \X — n\ < 5ji < /i/2. Plugging in the 5, 




Thus with probability at least 



X-^^\< \^<^<r{a,X) < 1.5r(a,/i). 



Now suppose /i < Then using Theorem I4.7f b) with a = f , we obtain that with probability at least 
1 _ 2-f^{") we have X < f , and therefore |X - /x| < f < r(a, X) and 



Proof of in Claim For each price p G "P let {Zi^p}i<,n be a family of independent 0-1 random 
variables with expectation S{p). Without loss of generality, let us pretend that the i-th time that price p is 
selected by the pricing strategy, sale happens if and only if Zj p = 1. Then by Lemma 148] after the i-th play 
of price p the bound (|7]l holds with probability at least 1 — n~^. Taking the Union Bound over all choices of 
i and all choices of p, we obtain that Q holds with probability at least 1 — as long as \V\ < n (which 
is the case for us). □ 

Sharper Azuma-Hoeffding inequality. We use a concentration inequality on the sum of n random vari- 
ables Xt € {0,1} such that each variable Xt is a random coin toss with probability Mt that depends on the 
previous variables Xi, . . . , Xt-i. We are interested in bounding the deviation |X — M|, where X = J2t -^t 
and M = J2t ^'^t- The well-known Azuma-Hoeffding inequality states that with high probability we have 
\X — M\ < 0{\/n log n). However, we need a sharper high-probability bound: \X — M\ < 0{\/ M logn). 
Moreover, we need an extension of such bound which considers deviation | X^tLi ^t(Xt — Mt)\, where each 
multiplier at £ [0, 1] is determined by Xi, . . . , Xt-i. 

We use the following concentration inequality from the literature. 



Theorem 4.9 (Theorem 3.15 in [331). Let Zi, . . . , Zn be random variables which take values in [—1, 1]. Let 
Z = X]"=i ^t, fJ- = ]E[Z]. Let V = "^i^i Var{Zt\Zi, . . . , Zf-i). Then for any a > 0,v > we have 



We use the above bound to bound the deviation for | Ylt=i '^t{Xt — Mt)\. 

Theorem 4.10. Let Xi, . . . , Xn be 0-1 random variables. Let M = Ylt=i ^[^t\Xi, . . . , Xt-i]. For each 
t, let at G [0, 1] be the multiplier determined by Xi, . . . , Xt-i. Then for any b > 1 the event 



holds with probability at least 1 — n 

Proof. Let Zt = Xt — yt, where yt G [0, 1] is a function of Xi, . . . , Xt-i, and let Z = Yl^=i ^t- 
We claim that 

[\Et=i MZt - HZt])\ < b{^JM logn + logn)] > 1 - n-^^^\ for any h > 1. (18) 

To prove (ITSl) . let Tt = . . . ,Xt) be the a-algebra generated by Xi, . . . ,Xt, and let Mt = 

K[Xt\Xi, . . . ,Xt^i]. Then conditional on Tt~i, Zt is a random variable with expectation Mt — yt and 
two possible values, —at yt and at (1 — yt), where at and yt are constants. It follows that Yai^ZtlJ^t-i) = 
af{Mt - Mi) < Mt, and therefore V = j2t=i Var(Zt| < M. 



X-^l < f <r(a,X) < (1 + ^/2) f <3r(a,/i). □ 



Pr[(|Z-/i| > a) A (y < v)] < e 




EILi MXt - Mt)\ < 6(VM logn + logn). 
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Taking Theorem 14 .9 1 with a = b{^/v \ogn + log n), we have that for any 6 > 1 the event 

{\Z - E[Z]\ > biy'v logn + logn)) A {V < v). 

holds with probability at most n~^(^). Finally, we take the Union Bound over (say) all integer v between 
log n and n, noting that V < M. This completes the proof of (fTSl) . 

Finally, to prove the theorem take (fTSl) with yt = Mt and note that Zt = Xt — Mt and so K[Zt] =0. □ 

Proof of ^ and I© in Claim R2] Recall that for each t, Xt is a 0-1 random variable with expectation 
S{pt), where pt depends on Xi, . . . , Xt-i. Using Lemma 14.101 with at = 1 we obtain ([8]). Using 
Lemma |4. 101 with at = Pt we obtain □ 



5 The 0{\/k log n) regret bound (Theorem 11.31) 



We show that the pricing strategy from Section |4] (with a different parameter) satisfies an improved regret 
bound, 0{y/k log n), if the demand distribution is regular and moreover the ratio ^ is sufficiently small. The 
regret bound depends on a distribution-specific constant. 

Theorem 5.1. For any regular demand distribution F there exist positive constants sp and cp such that 
CappedUCB with parameter 6 = /c"^/^ log{n) achieves regret 0{cf \'^logn) whenever - < sp- For 
monotone hazard rate distributions we can take sp = |. 

Proof. Let g{s) = s5~^(s) be a function from [5(1), 1] to [0, 1] that maps a survival rate to the corre- 
sponding revenue. Regularity implies g"{-) < 0. Since ^'(O) > 0, we can pick a constant sp > Q such that 
C = g'{sp) > 0. For monotone hazard rate distributions we can take sp = j because for any maximizer 
s of g{-) it holds that s > ^ (see Claim IB. 21 ). Now, for any ^ < spwe have that g'{^) > C. We will 
use this to obtain a lower bound on A(p); any such lower bound is absent in the analysis in Section |4l This 
improvement results in savings in ([T6l ). which in turn implies the claimed regret bound. 

We will use the notation from Section 1421 particularly the "badness" A{p) and the set of arms of 
badness > e that have been selected at least once. Note that by regularity g'{s) > C for any s G (0, Let 
p* = S'"^(^) and p e Ve- By the third line in ([131) it holds that S{p) < ^ and then p> p*. 

First, we claim that S{p) < Indeed, this is because = g{S{p)) < g{^) = p* ^. 

Second, we bound A(j») from below: 

^^:ct>(i-^)7^>(i-5)5(|) 

A{p)>{l-6)g{^)-g{Sip)) 
>[9ili)-9{S{p))]-6g{^) 
>C{^-S{p))-6^p* 

Since V is given by (21), it holds that Ve C {p*a {1 + Sf : i e N} for some a> I. Define 

r' = {p£V,: p = p*a (1 + 6y with i > 
Then for any p £ V' it holds that p/p* = a(l + 5)* > I + i6 and therefore 

^[P) ^^ l+iS ) - 2 n l+iS- 
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Therefore, noting that \V'\ < \V\ < log we have 

^ E,,v' AM < ^ T.,ev' (1 + i^) < ^ il-P'l + i log \V'\) < 0(^i log 1) 

Plugging this into (fT6l ) with e = (5 | , we obtain: 

lEpeP.AM<oa iogi)(i + ^) 

Regret < 0{6k + i(l + ^)(logn)2 + ^A;logn) (19) 
n), where cj? = 1 + 1/C. 

The regret bound ([T9l ) improves over the corresponding bound (fT/] ) in Section ID We obtain the final bound 
by plugging S = k"^/"^ log n. □ 

It is desirable to achieve the bounds in Theorem 11.21 and Theorem 15.11 using the same pricing strategy. 
Unfortunately, the choice of parameter 5 in Theorem 15. ll results in a trivial 0{k) regret guarantee for ai^bi- 
trary demand distributions (as per Equation ([17])). However, varying 5 and using Equations ([TT] ) and (fT9l ) 
we obtain a family of pricing strategies that improve over the bound in Theorem ll.2l for the "nice" setting in 
Theorem 15.11 and moreover have non-trivial regret bounds for ai^bitrary demand distributions. 

Theorem 5.2. For each 7 G [| , consider pricing strategy CappedUCB with parameter 6 = 0{k~"'). This 
pricing strategy achieves regret 0{k^~'^){l + i/g'{^)) if the demand distribution is regular and g'{^) > 0, 
and regret 0{k'^^) for arbitrary demand distributions. 



6 Lower Bounds 

We prove two lower bounds on regret over all demand distributions which match the upper bounds in Theo- 
rem ll.2l and Theorem II. 3 [ respectively. (Note that the latter upper bound is specific to regular distributions.) 
Throughout this section, regret is with respect to the fixed-price benchmark. 

Theorem 6.1. Consider the dynamic pricing problem with limited supply: with n agents and k <n items. 

(a) No detail-free pricing strategy can achieve regret o{k'^^^)for arbitrarily large k, n. 

(b) For any 7 < ^, no detail-free pricing strategy can achieve regret 0{cp k^') for all demand distri- 
butions F and arbitrarily large k, n, where the constant cp can depend on F. 

Our proof is a black-box reduction to the unlimited supply case {k = n). The unlimited supply case of 
Theorem 16. H is proved in |[30l (see Footnote|7]on page[5l). 

Proof. Suppose that some pricing strategy A violates part (a). Then there is a sequence {/cj, njjjgN, where 
ki < Ui and {/cijjgN is strictly increasing, such that A achieves regret o{k'^/'^) for all problem instances with 
Hi agents and ki items, for each i G N. To obtain a contradiction, let us use A to solve the unlimited supply 
problem with regret o(n^/^). Specifically, we will solve problem instances with ki/A agents, for each i. 

Fix i G N and let k = ki and n = rii. Consider a problem instance Z with unlimited supply and k/4 
agents and survival rate S{-). Let Z' be an artificial problem instance with unlimited supply and n agents, 
so that the first k/A agents in Z' con^espond to Z. Form an artificial problem instance J' with k items and n 
agents as follows: in each round, A outputs a price, then with probability k/2n this price is offered to the 
next agent in Z', and with the remaining probability there is no interaction with agents in Z' and no sale. 
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Since the demand distribution for J is a mixture of the "no sale" event which happens with probability 
1 — ^ and the original demand distribution for I, the survival rate for J is given by Sj{p) = ^S{p). 

Running A on problem instance J induces a pricing strategy A' on the original problem instance X0 
In the rest of the proof we show that A' achieves regret o(/c^/'^) on I. 

Let Rev j (A) and Rev j {A) be, respectively, the expected revenue and the realized revenue of A on 
problem instance J'. Let r = argmaXppS'(p) be the Myerson reserve price, and let Ar be the fixed-price 
strategy with price r. By our assumption, we have that Revj{A) > Revj{Ar) — o{k'^/^). We need to 
deduce that Rev2(^') > Revj(A) - 0(^2/^). 

Let N be the number of rounds in J' in which A interacts with the agents in I'. With high probability 
J < N < k.Let us condition on N and the event £n = {k/4 < N < k}: 

K[R^j{Ar) \N,£n] = NrS{r) 
E[R^j{A) -R^x{A')\N,£n] < (iV- ^)r5(r). 
Since E[iV] = |, it follows that 

Revx(^0 > Revj(^) - f r5(r) - o(l) 

> Revj{Ar) - f rS{r) - o{k^/^) 
= \rS{r)-o[k"^) 
= Revx(A) - o{k^/^), 

as required. The reduction for part (b) proceeds similai^ly. □ 



7 Selling very few items: proof of Theorem IL5 



In this section we target a case when very few items are available for sale (roughly, k < 0(log^ n)), so that 
the bound in Theorem 11.11 becomes trivial. We provide a different pricing strategy whose regret does not 
depend on n, under the mild assumption of monotone hazard rate. 

We rely on the characterization in Claim [l!2l we look for the price p* = max(pi., S^^{yJ), where p^ = 
argmaXp p S{p) is the Myerson reserve price. The pricing strategy proceeds as follows (see Mechanism|2]on 
page [17]). It considers prices p£ = {1 — 6Y, ^ € N sequentially in the descending order. For each £, it offers 
the price pi to a fixed number of agents. The loop stops once the pricing strategy detects that, essentially, 
the "best" pe has been reached: either S{pe) is close to -, or we are near a maximum of p S{p). Pai'ameters 
are chosen so as to minimize regret. 

Theorem 7.1. For some parameters e and 5, Mechanism^achieves regret O (fc^/^ poly log(A;)) with respect 
to the offline benchmark, for any demand distribution that satisfies the monotone hazard rate condition. 

The rest of this section is devoted to proving Theorem |7.1| for parameters e = k~^/^ and 5 = log k)^/^. 
We will assume that the demand distribution is MHR, without further notice. We derive Theorem 17 . 1 1 from 
the following multiplicative bound; it appears difficult to prove the additive version directly. 



Lemma 7.2. Assume p* > e. Set ^ = \J \ log k log ^ log log ^. Then the expected revenue of Mechanism^ 
is at least 1 — 0{5) fraction of the offline benchmark. 

Proof of Theorem \7J\ If p* < e then the expected loss in revenue is at most ek. Else by Lemma |7!2] the 
expected loss in revenue is at most 0{6k), where 6 is from Lemma |7!2l In both cases the additive regret 
compared to the offline benchmark is at most max(e/c, 0{k5)). Finally, pick e = k~^/^. □ 

"if .4 stops before it iterates through all agents in X, the remaining agents in I are offered a price of 00. 
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Mechanism 2 Descending prices 



Parameter: Approximation parameters 5, e G [0, 1] 

1: Leta = (l)^"'', 7 = min(a, 1/e). 

2: £^0, £max ^ 0, i?max ^ 0. 

3: repeat 

4: £^£+1, Pi ^{1+6)-^ 

5: Offer price to m = [(^ iog^^"(i/e) 1 agents. 

6: Let Si be the fraction of them who accept. 

7: Let = piSi be the average per agent revenue. 

8: If Si > {1 + Sy^-f and Re > R,na^, 

9: then i^max ^ 

10: until < e or 5^ > (1 + 6)a or Ri < {I + S)~^R 

11: Offer price p = pi so long as unsold items remain. 



7.1 Proof of Lemma |721 

We use a multiplicative bound in which fixed-price strategies for limited supply are compared to those for 
unlimited supply (which in turn can be compared to the offline benchmark using Claim IA!21 ). 

Lemma 7.3. Assume the demand distribution is regular. Letp' < p be two prices such that p > S^^{k/n). 
Letn' < n. Then Rev(^^'(p')) > ~ U - Rev(^;^(p)). 

The proof uses a technique from [40], see AppendixjA] Also, we take advantage of several properties of 
MHR distributions, detailed in Appendix iBl 

We say the exploration phase is 5-approximate if 

S{pi) > 7 ^ 1^5 < Si/S{pi) <l + 5. 

Claim 7.4. The exploration phase is 6-approximate with probability at least 1 — 2 (log^^^^ i) g"'^^'''"*/^. 

Proof. This follows directly by applying Chernoff bounds (both the upper and lower tail form) to the event 
that some Si violates the condition, then applying the union bound over all choices of I. □ 

Claim 7.5. When the exploration phase is 5-approximate, we have (1 — 75)S~^ (^) <p<p*. 

Proof. It is easy to see that none of the stopping conditions of the exploration phase can be triggered until 
the price goes below p*. Therefore p < p*. For the other inequality observe that, by Claim lB3] it holds that 
•S^Ha) > (1 - 5'"^^)- Therefore it suffices to show that p > {I - 66) S'-^(a). 

Assume for a contradiction that the stopping conditions are not triggered in some phase £ such that 
Pi+i < (1 + 6)~^ S~^{a). Therefore, at round £ we have 

Pi = {1 + 6)pi+i < {I + 6)-'^ S~\a) (20) 

Examining the stopping conditions, and using our assumption above, we deduce that: 

5^ < (1 + 6)a (21) 
Rm../Re<{l + S)\ (22) 
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Combining (l20l ) and (|2TI ). we get 

= p^Se < (1 + (5)~^q5"^ (a) (23) 

Note that, since we chose round £ such that pi <C 5~^(a), the pricing strategy already encountered some 
round t < £ such that pt is"close" to ^"^(q) - in particular 

{l + 6r'S-\a)<pt<S~\a) (24) 

and therefore also S{pt) > a. Since we assume the exploration phase is 5-approximate, the estimated 
survival rate at round t satisfies > (1 + 6)~^S{pt) > (1 + S)~^a. Combining this with (|24l) . we get that 
the estimated revenue Rt at round t satisfies 

Rt = PtSt > (1 + 6y^aS"\a) (25) 

The value of i?max in round £ is at least Rt. Combining (1251 ) with (1231 ). this shows that at round £ we 
have > (1 + 5)^, contradicting ([22]). □ 

Claim 7.6. When the exploration phase is 5-approximate, we have R{'p) > (1 — 75)R{p*). 

Proof. By Claim 1731 we are done when p* = (^). Therefore, assume p* = pi-, the Myerson reserve 
price. It is easy to see that R{p£^i) > R{pe) for each £. Let t be the first integer such that pt < p* = Pr. 
Note that {I + 5)~^p* < pt < p*. Claim |73] says that p < p* = p^, therefore £ > t and by Claim EH 
S{pt) ^ S{p^) > 1/e > 7- It suffices to show that a stopping condition must be triggered before R{pe) gets 
too small. 

Assume for a contradiction that the stopping condition is not triggered by phase £ > t, for some £ 
such that R{p£^i) < (1 — 76)R{p*). Since R decreases slowly as described above, it follows that t < £. 
Moreover, since we assumed the exploration phase is J-approximate, St > S{pt) > 7. Therefore, 
during phase £ we have Rmax ^ Rt = StPt > {j^Y R{P*)- Since no stopping condition is triggered 
for phase £, it must be that R^ > {j^f Rmax > (j^)^ R{p*). Moreover R{pe+i) > R{pi>) > 
Re > {j^fR{p*), a contradiction. □ 

We can now complete the proof of Lemma IT21 

We condition on the exploration phase being ^-approximate. Let n' and k' be the number of players and 
items left after the exploration phase, respectively. In the exploitation phase, we attain expected revenue 
Rev{A^, {p)). Moreover, in the exploration phase we attained revenue at least {k — k')p, since we only used 
prices greater than or equal to p. Therefore, the total expected revenue of our pricing strategy is at least 
Rev(^'^, (p)) + ik' — k)p. It is easy to see that this is at least Y\.ev{A^ (p)). 

It remains to bound the expected revenue of A^' (p). Observe that — > 1 — 5. For brevity, denote 

There ai"e two cases. In the first case, p* = S ^(^)- Lemma 1731 and Claim 1731 imply that 

Rev{Al'{p)) > /3 -4 ReviA^ip*)) > /3 (1 - 85) ReviAlip*))- 
n p* 

The second case is p* = p^. By Claim 17^ and unimodality of R, we have that 

Rev(^;j(max(5"i(|),p))) > Rev(^;^(p)) > (1 - 75) Rev«(/)). 
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Moreover, using Lemma 1731 Claim 1731 and the equation above we show that 

Rev{Al'{p)) > /3 (1 - 86) Rev{Al{max{S-H^),p))) > /3 (1 - 155) Rev(^;^(p*). 

By Lemma IA.2I Mechanism |2] achieves, in expectation, at least the following fraction of the expected 
revenue of the offline benchmark: 

13(1- 0{6)) (1 - 2 logi+5(7) exp(-i ^'^^)) ■ 

Now, plug 6 into Lemma [731 and m as defined in the pricing strategy. Note that m = Qij^^jj^)- We obtain 
the final bound replacing 7 by the lesser quantity ^, and using the fact that \ogi_^_s{x) = 6(| logx). 

8 Conclusions and open questions 

We consider dynamic pricing with limited supply and achieve near-optimal performance using an index- 
based bandit-style algorithm. A key idea in designing this algorithm is that we define the index of an ai"m 
(price) according to the estimated expected total payoff from this arm given the known constraints. 

It is worth noting that a good index-based algorithm did not have to exist in our setting. Indeed, many 
bandit algorithms in the literature are not index-based, e.g. EXP3 Ol and "zooming algorithm" |[29l and their 
respective variants. The fact that Gittins algorithm ll24l and UCBl H achieve (near-)optimal performance 
with index-based algorithms was widely seen as an impressive contribution. 

While in this paper we apply the above key idea to a specific index-based algorithm (UCBl), it can be 
seen as an (informal) general reduction for index-based algorithms for dynamic pricing, from unlimited 
supply to limited supply. This reduction may help with more general dynamic pricing settings (more on 
that below), and moreover it can be extended to other bandit-style settings where the "best arm" is not an 
arm with the best expected per-round payoff. In particular, an ongoing project ||T1 uses this reduction in the 
context of adaptive crowd-selection in crowdsourcing. 

It is an interesting open question whether a reduction such as above can be made more formal, and which 
algorithms and which settings it can be applied to. An ambitions conjecture for our setting is that there is 
a simple black-box reduction from unlimited supply to limited supply that applies to arbitrary "reasonable" 
algorithms. In the full generality this conjecture appears problematic; in particular, some reasonable bandit 
algorithms such as EXP3 are hard-coded to spend a prohibitively large amount of time on exploration. 

This paper gives rise to a number of more concrete open questions. The most immediate ones concern 
extending our upper and lower bounds for, respectively, more general and more specific classes of demand 
functions. First, it is desirable to extend Theorem 1 1.1 1 to possibly irregular distributions, i.e. obtain non- 
trivial regret bounds with respect to the offline benchmark. Second, one wonders whether the optimal 
0{cf Vk) regret rate from Theorem 11.31 can be extended to all regular demand distributions. Third, it is 
open whether our lower bounds can be strengthened to regular demand distributions. 

Further, it is desirable to extend dynamic pricing with limited supply beyond IID valuations. A recent 
result in this direction is lITSl . where the demand distribution can change exactly once, at some point in time 
that is unknown to the mechanism. Some of the natural specific targets for further work are slowly changing 
valuations and adversarial valuations. One promising approach for slowly changing valuations is to apply 
the reduction from this paper to index-based algorithms for the corresponding bandit setting ll38l[371 . 

Acknowledgements. We aie grateful to Jason Hartline, Qiqi Yan and Assaf Zeevi for their comments. 
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Appendix A: Benchmark comparison 



We start with a self-contained proof of a slightly weaker version of Lemma 13.11 (which suffices for the 
purposes of this paper). 

Lemma A.l (Yan 1401 ). For each regular demand distribution there exists a fixed-price strategy whose 
expected revenue is at least the offline benchmark minus 0{yjk log k). 

Recall that A^{p) denotes the fixed-price strategy with k items, n agents, and fixed price p. Let de- 
note the optimal (expected revenue maximizing) offline auction with n-players and fc-items. As in Claim [l!2l 
let p* = max(pi-, S~^{^)), where p^ = argmax^p 5'(p) is the Myerson reserve price. 

Claim A.2. If the demand distribution is regular then Rev(^"(p*)) > Rev(M^). 

Proof of Claim lA2l Let qi be the probability that sells to agent i. By symmetry, qi = qj for all players i 
and j, so we simply denote this probability by q. help = S^^{q) be the single price we would need to offer 
a agent in order to sell to him with probability q. Since i? is a concave function of the selling probability, 
Jensen's inequality implies that R{p) is an upper bound on the revenue collected by the Myerson auction 
from a single agent. Equivalently: nR{p) > Rev(M^). 

Now, observe that the expected number of items sold by MJ! is nq. Since MJ! never sells more than k 
items, it must be that q < ^. Therefore, p > S~^{^). By definition of p*, we deduce that there are two 
cases: (1) p* = p^, or (2) p^ < p* = S~^{-) < p. In case (1) it is clear that R{p*) > R{p)- In case (2) we 
get that R{p*) > R{p) since R{x) is decreasing for x > p^. Then 

Ke^{Al{p*)) = nR{p*) > nR{p) > Rev(M^). □ 

Lemma \AA\ follows from Claim lA!2l and Claim \32\ because for p = p* vje have S{p) < ^, and so 

u{p) = p m.m{k,nS{p))) = np* S{p*) = Rev{Al{p*)) > Rev(M^). 

Multiplicative bounds. Further, we derive a multiplicative bound in which fixed-price strategies for lim- 
ited supply aie compai^ed to those for unlimited supply. We use this bound to prove Lemma 1731 

Claim A. 3. For any regular demand distribution and any p > 5'^^(^) it holds that 

Rev{Al{p)) > (l - ^) ReviA^ip)). 

Proof. The proof uses a technique from POl . As a thought experiment, consider an environment where agent 
valuations aie correlated as follows: The joint distribution of agent valuations can be sampled by choosing a 
set «S" of k players uniformly at random, then for each agent in S' sampling from the conditional distribution 
F (x) I 5-1 and for each agent not in S' sampling from the conditional distribution i^(a;)|a;<5~i(fc/n)- 
Observe that each agent's valuation is distributed according to F, yet at any point exactly k players have 
value exceeding S~^{k/n). 

Let T' be the set of players in this correlated environment whose valuation exceeds p. The probability 
of a particular agent being included in T' is S{p), and E[|T'|] = nS{p). Since p > S~^{k/n), it is clear 
that T' C S' and therefore < |T'| < /c. 

Now consider our original environment where each agent's valuation is drawn i.i.d from F. Let T be 
the set of players in this environment whose valuations exceed p. The probability of a agent being included 
in T is S{p) - the same as the probability of being included in T' . However, each agent is included in T 
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independently with probability S{p). As a result, some of the players in T do not win an item - this happens 
when |T| > k. We can write the revenue of A^ijp) in this i.i.d environment as follows. 

Rev(^^(p))] =pE[min(|r|,A:)] (26) 

Now, observe that r{Y) = min(|y|, k) is the rank function of the fc-uniform matroid. Moreover, it was 



shown in BOl that the correlation gap of this function is /? = ^1 — Therefore, since each agent is 

included in T independently, we know by the definition of the correlation gap and the fact that T and T' 
have the same marginals that 

E[r(r)] > /3E[r(r')]. (27) 
Recall that T' is always bounded between and k, therefore r{T') = \T'\. Combining (l26l ) and (|27] ). we get 

Rev(^^(p)) = p E[min(|r|, A:) ] > /3 pE[|r'|] = /3 pnS{p)) = /3 Rev(^;;^(p))). □ 

Corollary (Lemma I7.3K Assume the demand distribution is regular. Let p < p' be two prices such that 
P > S-\k/n). Letn' < n. Then Rev(^^'(p')) > (l - ^) ^e^{A[\{p)). 

Proof. Observe that A^{p') sells at least as many items as A^{p) for every realization of the bids, but at 
price p' instead of p. Therefore Rev(A^(p')) > ^Rev(A^(p)). Combining with Claim IA3] we get that 

Kev{Al{p')) > ^ (l - ^) ^e^{Al{p)). 

Next, a simple (omitted) argument shows that the revenue Rev(A5!(p)) of a fixed price auction exhibits 
diminishing marginal returns in the number n of players. Therefore, Rev(^^'(p)) > ^Rev(^^(p)). □ 

Let us note in passing that Claim lA31 and Claim ll!2] imply a stronger, multiplicative version of Lemma lBTTl 
which is also immediate from HOl. 



Lemma A.4 (Yan II40II ). Assume that the demand distribution is regular. Then there exists a fixed-price 
strategy whose expected revenue approximates the offline benchmark up to a factor 1 — -^7=^- 

Appendix B: Monotone Hazard Rate distributions 

Let us state and prove several properties of Monotone Hazard Rate (MHR) distributions which we use in 
Section|5]and Section|2l Thi^oughout, for a distribution F we use F{x) to denote the c.d.f, S{x) = 1 — F{x) 
to denote the survival rate, and f{x) to denote the p.d.f. 

We begin with a simple known characterization of MHR distributions. 

Fact B.l. A distribution is MHR if and only if S{-) is log-concave (i.e. log S{x) is a concave function ofx). 

Next, we bound the survival probability at the Myerson reserve price. 

Claim B.2. Let F be an MHR distribution with support on [0, 00], and let S{x) = 1 — F{x). Let r G 
argmaxi?(-) where R{x) = xS{x). Then S{r) > 1/e. 

Proof. We have R!{r) = S{r) + rS'{r) = 0. Moreover, by Fact IB. II we deduce that 

\ogS{T) d S'{r) 
> — log(5(x))|,. - 



r dx S{r) 

Combining with the previous equality, we have =-r < which is equivalent to S{r) > ^ □ 
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We now use log-concavity to bound the sensitivity of the inverse of the survival function. 
Claim B.3. Let F be an MHR distribution with support on [0, oo], and let a, /3 G [0, 1] with (3 > a. Then 

log (a) 

Proof. BvFact lBTTl los:(S(x]) is a concave, decreasing function of x with log(5'(0)) = and lim^^^oo log(S'(2;)) 
— oo. Let a = S^^{a) and b = 5~^(/3). By Jensen's inequality, this gives for every a,b £ [0, oo] with b < a 

log(5(b)) ^ b 
log(5(a)) - a 

Let a = S~^{a) and b = S~^{P). Plugging into the above inequality completes the proof. □ 
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